A genetic algorithm for text mining
نویسندگان
چکیده
Text workers should find ways of representing huge amounts of text in a more compact form. Textual documents can be represented by concepts. One way to define the concepts is by the terms, keywords extracted from the textual documents and cleaned by several processes like stopwords and stemming. Using the frequencies of the terms, one can quantify the relations between documents or portions of text. These relations can serve many applications, like information retrieval or automatic text classification. Another way to define the concepts is by the sets of correlated terms rather then by raw terms. Correlated terms usually have a more specific meaning. Finding meaningful concepts within a huge collection of corpuses in a reasonable timeframe is a difficult task to accomplish. This paper describes a new text mining process to uncover interesting term correlations. The process uses a genetic algorithm to cope with the combinatorial explosion of the term sets. The genetic algorithm identifies combinations of terms that optimize an objective function, which is the cornerstone of the process. We have tested a function designed to optimize the discriminating power of the term sets. The genetic model was tested on a TREC sub-collection. The parameters were set to discover a thousand combinations of correlated terms. These sets of terms were further added to the basic index and applied to the information retrieval problem. The experiment revealed that the augmented index was unable to improve the effectiveness of the retrieval, when compared with the vector space model.
منابع مشابه
A Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملA new stochastic 3D seismic inversion using direct sequential simulation and co-simulation in a genetic algorithm framework
Stochastic seismic inversion is a family of inversion algorithms in which the inverse solution was carried out using geostatistical simulation. In this work, a new 3D stochastic seismic inversion was developed in the MATLAB programming software. The proposed inversion algorithm is an iterative procedure that uses the principle of cross-over genetic algorithms as the global optimization techniqu...
متن کاملMining Interesting Aspects of a Product using Aspect-based Opinion Mining from Product Reviews (RESEARCH NOTE)
As the internet and its applications are growing, E-commerce has become one of its rapid applications. Customers of E-commerce were provided with the opportunity to express their opinion about the product on the web as a text in the form of reviews. In the previous studies, mere founding sentiment from reviews was not helpful to get the exact opinion of the review. In this paper, we have used A...
متن کاملDesigning an intelligent system for predicting chromosomal genetic diseases using data mining
Background and Aim: Today we are witnessing tremendous advances in medical data mining. The data, by analyzing and discovering the relationships between them, can lead to algorithms that help us prevent or treat many diseases. Meanwhile, genetic diseases have attracted a large part of the attention of the medical world because the birth of children with genetic disorders imposes a great financi...
متن کاملUsing a combination of genetic algorithm and particle swarm optimization algorithm for GEMTIP modeling of spectral-induced polarization data
The generalized effective-medium theory of induced polarization (GEMTIP) is a newly developed relaxation model that incorporates the petro-physical and structural characteristics of polarizable rocks in the grain/porous scale to model their complex resistivity/conductivity spectra. The inversion of the GEMTIP relaxation model parameter from spectral-induced polarization data is a challenging is...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005